DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. We have been tasked to conduct an analysis of existing employee data.
We evaluated the employee data, conducted in-depth analysis of various parameters provided to us. We came up with various factors which impacts the employee attrition and as well as factors which can help predict the attrition of an employee. We also built a profile of an employee with high probability of leaving the company.
First we will see a sample of our data to understand what kind of variables we are evaluating.
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 3 80 1
## 2 3 1 80 0
## 3 3 3 80 0
## 4 3 3 80 2
## 5 3 3 80 0
## 6 4 3 80 2
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 3 2 5
## 2 21 2 4 20
## 3 10 2 3 2
## 4 14 3 3 14
## 5 6 2 3 6
## 6 9 4 2 9
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 2 0 3
## 2 7 4 9
## 3 2 2 2
## 4 10 5 7
## 5 3 1 3
## 6 7 1 7
Our Data set has following variables and records.
## [1] "Number of variables: 36"
## [1] "Number of Records: 870"
## [1] "List of Variables: "
## [1] "ID" "Age"
## [3] "Attrition" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EmployeeCount"
## [11] "EmployeeNumber" "EnvironmentSatisfaction"
## [13] "Gender" "HourlyRate"
## [15] "JobInvolvement" "JobLevel"
## [17] "JobRole" "JobSatisfaction"
## [19] "MaritalStatus" "MonthlyIncome"
## [21] "MonthlyRate" "NumCompaniesWorked"
## [23] "Over18" "OverTime"
## [25] "PercentSalaryHike" "PerformanceRating"
## [27] "RelationshipSatisfaction" "StandardHours"
## [29] "StockOptionLevel" "TotalWorkingYears"
## [31] "TrainingTimesLastYear" "WorkLifeBalance"
## [33] "YearsAtCompany" "YearsInCurrentRole"
## [35] "YearsSinceLastPromotion" "YearsWithCurrManager"
We have to clean up the data now to be able to use if for our analysis. First we will set the correct data type for the variables.
Lets first check the distribution of Monthly Income and distribution of Attrition as they are the data of Interest here.
The Monthly Income looks right skewed. Which is expected of a Income data and its in line with our expectation.
The Attrition data appears to be exteremly unbalanced. We have large number of “No’s” compared to “Yes’s”. Which means the data analysis could suffer from bias.
Before we take any action to balance the data, lets evaluate little more.
Lets check if Monthly Income and Age has any relationship with Attrition.
It looks like the “No” attrition is overwhelming compared to the acutal attrition.
Let’s check one more set of plots. Do we have a relationship between Attrition and the Job Level, Job Role, Job Involvement and Job Satisfaction?
Again its hard to tell if Job Level plays any role at in the Attrition. Because the “No” values are disproportionately higher.
To solve this problem, we will balance the data using Down Sample feature. Where data will be reduced to match the Attrition values of Yes and No.
## [1] "Number of variables: 36"
## [1] "Number of Records: 280"
## [1] "List of Variables: "
## [1] "ID" "Age"
## [3] "Attrition" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EmployeeCount"
## [11] "EmployeeNumber" "EnvironmentSatisfaction"
## [13] "Gender" "HourlyRate"
## [15] "JobInvolvement" "JobLevel"
## [17] "JobRole" "JobSatisfaction"
## [19] "MaritalStatus" "MonthlyIncome"
## [21] "MonthlyRate" "NumCompaniesWorked"
## [23] "Over18" "OverTime"
## [25] "PercentSalaryHike" "PerformanceRating"
## [27] "RelationshipSatisfaction" "StandardHours"
## [29] "StockOptionLevel" "TotalWorkingYears"
## [31] "TrainingTimesLastYear" "WorkLifeBalance"
## [33] "YearsAtCompany" "YearsInCurrentRole"
## [35] "YearsSinceLastPromotion" "YearsWithCurrManager"
Lets look at the Monthly Income and Age in relationship with Attrition again with balanced data.
We can now see a more clear pattern where Lower Income and Yonger Age has clear relationhip to higher Attrition.
Let’s check the relationship between Attrition and the Job Level, Job Role, Job Involvement and Job Satisfaction again.
We can see that Job Level 1, Job Role of Laboratory Technician and Research Scientists, Lower Job Involvement and Lower Job Satisfation leads to higher Attrition rate.
We will continue to do our data analysis using various parameters and plots to see if there is any relationship amoung the different variables with Monthly Income and Attrition.
From above plots we notice that Gender doesn’t play a role in terms of Attrition or Income. However, Stock Option, Over Time are very important factor in Attrition. Marital status also appears to play a role in Attrition.
To further our analysis and validate the relationship of variables with each other. We will do heatmaps to check how variables are correlated to each other.
Based on Exploratory Data Analysis we can say that a typical profile of Employee who is likely to leave a company is follows:
Surprisingly, following factors doesn’t have a bearing on the high Attrition rate:
After Exploratory Data Analysis, we will move on to Statistical Analysis of our Data.
First we will start with regression of Monthly Income.
To start with, we will remove unwanted variables like: 1. ID - ID shouldn’t matter 2. Employee Count - All values are 1 3. Employee Number - Employee number shouldn’t matter 4. Over 18 - All values are Yes 5. Standard Hours - All values are 80
Once we have the dataset, we will run stepwise regression to identify the significant variables
##
## Call:
## lm(formula = MonthlyIncome ~ BusinessTravel + DistanceFromHome +
## Gender + JobLevel + JobRole + TotalWorkingYears + YearsInCurrentRole,
## data = data.red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2786.5 -621.9 -40.7 546.7 3989.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2803.511 414.929 6.757 9.17e-11 ***
## BusinessTravelTravel_Frequently 446.795 242.976 1.839 0.06707 .
## BusinessTravelTravel_Rarely 848.644 208.736 4.066 6.35e-05 ***
## DistanceFromHome -21.175 7.427 -2.851 0.00470 **
## GenderMale 188.245 133.238 1.413 0.15889
## JobLevel2 1935.318 272.643 7.098 1.19e-11 ***
## JobLevel3 4746.914 360.499 13.168 < 2e-16 ***
## JobLevel4 8349.739 535.478 15.593 < 2e-16 ***
## JobLevel5 10876.864 612.231 17.766 < 2e-16 ***
## JobRoleHuman Resources -757.796 470.934 -1.609 0.10880
## JobRoleLaboratory Technician -1115.079 342.555 -3.255 0.00128 **
## JobRoleManager 3753.452 455.718 8.236 8.62e-15 ***
## JobRoleManufacturing Director 258.640 333.539 0.775 0.43878
## JobRoleResearch Director 3938.158 403.869 9.751 < 2e-16 ***
## JobRoleResearch Scientist -803.932 346.872 -2.318 0.02124 *
## JobRoleSales Executive 364.975 255.826 1.427 0.15487
## JobRoleSales Representative -1063.900 382.812 -2.779 0.00585 **
## TotalWorkingYears 31.151 15.624 1.994 0.04722 *
## YearsInCurrentRole 35.783 22.207 1.611 0.10830
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1025 on 261 degrees of freedom
## Multiple R-squared: 0.9524, Adjusted R-squared: 0.9491
## F-statistic: 290.1 on 18 and 261 DF, p-value: < 2.2e-16
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 3407.683 0.5067651 2676.7319 721.0088 0.16867373 481.0300
## 2 2 2942.690 0.5588824 2299.3641 686.4248 0.20866856 405.9648
## 3 3 1831.147 0.8320876 1516.7675 184.5672 0.05348063 182.9415
## 4 4 1300.565 0.9162584 960.7193 224.2638 0.03249893 156.1966
## 5 5 1243.909 0.9271295 933.5867 137.3650 0.02324789 100.6893
## 6 6 1112.919 0.9380030 827.0537 186.0700 0.02286253 134.2642
## 7 7 1081.871 0.9405316 805.2130 193.7222 0.02391327 129.3790
## 8 8 1093.060 0.9401301 821.8018 179.2924 0.02223264 123.2936
## 9 9 1080.926 0.9413322 810.9890 183.5284 0.02284979 123.7845
## 10 10 1073.305 0.9412974 804.6415 194.8817 0.02440464 127.6605
## 11 11 1068.352 0.9416120 798.9384 186.5494 0.02356698 119.6480
## 12 12 1078.962 0.9401214 813.9811 200.8309 0.02528910 137.1344
## 13 13 1065.720 0.9419299 807.3506 194.5372 0.02464378 139.9376
## 14 14 1086.219 0.9400449 825.5309 188.3027 0.02509340 145.9595
## 15 15 1100.238 0.9388400 838.3119 183.7699 0.02544104 145.7394
##
## Call:
## lm(formula = MonthlyIncome ~ BusinessTravel + DistanceFromHome +
## JobLevel + JobRole + TotalWorkingYears, data = data.red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3103.6 -615.1 -62.6 548.1 3772.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2943.693 403.992 7.287 3.71e-12 ***
## BusinessTravelTravel_Frequently 402.277 242.793 1.657 0.098738 .
## BusinessTravelTravel_Rarely 797.181 207.626 3.840 0.000155 ***
## DistanceFromHome -22.647 7.417 -3.053 0.002495 **
## JobLevel2 1982.965 270.742 7.324 2.94e-12 ***
## JobLevel3 4840.400 355.287 13.624 < 2e-16 ***
## JobLevel4 8392.418 536.238 15.651 < 2e-16 ***
## JobLevel5 10924.085 614.004 17.792 < 2e-16 ***
## JobRoleHuman Resources -697.920 471.746 -1.479 0.140219
## JobRoleLaboratory Technician -1027.635 341.026 -3.013 0.002836 **
## JobRoleManager 3695.383 456.280 8.099 2.08e-14 ***
## JobRoleManufacturing Director 275.794 333.423 0.827 0.408897
## JobRoleResearch Director 3866.907 403.800 9.576 < 2e-16 ***
## JobRoleResearch Scientist -745.963 346.487 -2.153 0.032233 *
## JobRoleSales Executive 403.523 255.766 1.578 0.115835
## JobRoleSales Representative -1012.452 382.174 -2.649 0.008557 **
## TotalWorkingYears 41.460 14.492 2.861 0.004565 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1029 on 263 degrees of freedom
## Multiple R-squared: 0.9517, Adjusted R-squared: 0.9487
## F-statistic: 323.7 on 16 and 263 DF, p-value: < 2.2e-16
Based on the regression analysis, we identify the following variables to have significant impact on the Monthly Income:
We will use these variables for our prediction.
Let’s first validate the model for Monthly Income in terms of Error (Root Mean Square Error)
## [1] "Root Mean Square Error: 1162.83567849355"
###Attrition
Now, lets run the statistical Analysis for Attrition.
We will use two different approaches for analysia and prediction of Attrition:
To start with, we will run Logistic Regression to identify the significant variables that contribute towards Attrition.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"),
## data = data.red)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3656 -0.2508 0.0000 0.2640 3.5295
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.413e+00 2.186e+03 -0.002 0.99839
## Age -2.171e-02 3.927e-02 -0.553 0.58029
## BusinessTravelTravel_Frequently 3.450e-01 1.229e+00 0.281 0.77899
## BusinessTravelTravel_Rarely 3.908e-02 1.126e+00 0.035 0.97232
## DailyRate 1.678e-04 7.516e-04 0.223 0.82338
## DepartmentResearch & Development 4.006e+01 2.458e+03 0.016 0.98700
## DepartmentSales 3.969e+01 2.458e+03 0.016 0.98712
## DistanceFromHome 1.195e-01 4.441e-02 2.690 0.00714 **
## Education2 8.002e-01 1.022e+00 0.783 0.43364
## Education3 1.960e+00 8.957e-01 2.188 0.02865 *
## Education4 9.234e-01 9.249e-01 0.998 0.31810
## Education5 -4.343e+00 1.770e+00 -2.453 0.01415 *
## EducationFieldLife Sciences -2.437e+01 1.124e+03 -0.022 0.98270
## EducationFieldMarketing -2.504e+01 1.124e+03 -0.022 0.98223
## EducationFieldMedical -2.434e+01 1.124e+03 -0.022 0.98273
## EducationFieldOther -2.389e+01 1.124e+03 -0.021 0.98304
## EducationFieldTechnical Degree -2.353e+01 1.124e+03 -0.021 0.98330
## EnvironmentSatisfaction2 -2.351e+00 1.068e+00 -2.201 0.02772 *
## EnvironmentSatisfaction3 -6.505e-01 8.794e-01 -0.740 0.45948
## EnvironmentSatisfaction4 -1.400e+00 8.278e-01 -1.691 0.09081 .
## GenderMale -1.048e+00 7.415e-01 -1.413 0.15773
## HourlyRate 1.289e-02 1.631e-02 0.791 0.42910
## JobInvolvement2 -1.332e+00 1.408e+00 -0.946 0.34410
## JobInvolvement3 -3.745e+00 1.312e+00 -2.855 0.00431 **
## JobInvolvement4 -5.266e+00 1.640e+00 -3.211 0.00132 **
## JobLevel2 -2.901e+00 1.496e+00 -1.939 0.05245 .
## JobLevel3 -2.554e+00 2.364e+00 -1.080 0.28013
## JobLevel4 -3.542e+00 3.584e+00 -0.988 0.32305
## JobLevel5 2.602e+00 4.542e+00 0.573 0.56674
## JobRoleHuman Resources 3.527e+01 2.458e+03 0.014 0.98855
## JobRoleLaboratory Technician -2.616e+00 1.871e+00 -1.398 0.16205
## JobRoleManager -1.626e+00 2.929e+00 -0.555 0.57894
## JobRoleManufacturing Director -3.704e+00 2.104e+00 -1.761 0.07831 .
## JobRoleResearch Director -6.273e+00 3.088e+00 -2.032 0.04220 *
## JobRoleResearch Scientist -4.210e+00 1.881e+00 -2.238 0.02520 *
## JobRoleSales Executive 5.940e-01 3.279e+00 0.181 0.85622
## JobRoleSales Representative 2.080e-02 3.475e+00 0.006 0.99522
## JobSatisfaction2 -2.114e+00 1.039e+00 -2.035 0.04182 *
## JobSatisfaction3 -1.158e+00 8.886e-01 -1.303 0.19243
## JobSatisfaction4 -2.657e+00 9.841e-01 -2.700 0.00694 **
## MaritalStatusMarried 1.009e+00 1.242e+00 0.813 0.41629
## MaritalStatusSingle -1.758e-01 1.600e+00 -0.110 0.91255
## MonthlyIncome -2.927e-04 2.870e-04 -1.020 0.30778
## MonthlyRate -1.042e-05 4.367e-05 -0.239 0.81144
## NumCompaniesWorked1 2.530e+00 1.046e+00 2.418 0.01560 *
## NumCompaniesWorked2 2.719e+00 1.497e+00 1.816 0.06932 .
## NumCompaniesWorked3 1.901e+00 1.621e+00 1.173 0.24080
## NumCompaniesWorked4 2.363e+00 1.604e+00 1.474 0.14060
## NumCompaniesWorked5 3.867e+00 1.759e+00 2.198 0.02794 *
## NumCompaniesWorked6 5.830e+00 2.317e+00 2.516 0.01186 *
## NumCompaniesWorked7 1.038e+00 1.548e+00 0.671 0.50234
## NumCompaniesWorked8 2.111e+00 1.726e+00 1.223 0.22138
## NumCompaniesWorked9 5.996e+00 2.551e+00 2.350 0.01875 *
## OverTimeYes 2.800e+00 7.013e-01 3.993 6.54e-05 ***
## PercentSalaryHike -1.011e-01 1.353e-01 -0.748 0.45459
## PerformanceRating4 4.838e-01 1.337e+00 0.362 0.71740
## RelationshipSatisfaction2 -3.772e+00 1.187e+00 -3.177 0.00149 **
## RelationshipSatisfaction3 -2.534e+00 9.873e-01 -2.567 0.01027 *
## RelationshipSatisfaction4 -1.245e+00 8.737e-01 -1.425 0.15410
## StockOptionLevel1 -3.364e+00 1.183e+00 -2.843 0.00447 **
## StockOptionLevel2 -6.063e+00 1.965e+00 -3.086 0.00203 **
## StockOptionLevel3 2.458e+00 1.932e+00 1.273 0.20312
## TotalWorkingYears 1.906e-02 1.048e-01 0.182 0.85572
## TrainingTimesLastYear -7.689e-02 2.383e-01 -0.323 0.74693
## WorkLifeBalance2 -1.579e+00 1.376e+00 -1.148 0.25107
## WorkLifeBalance3 -2.086e+00 1.351e+00 -1.544 0.12270
## WorkLifeBalance4 -2.083e+00 1.484e+00 -1.403 0.16062
## YearsAtCompany 2.111e-01 1.559e-01 1.354 0.17579
## YearsInCurrentRole -2.927e-01 2.117e-01 -1.383 0.16672
## YearsSinceLastPromotion 2.957e-01 1.429e-01 2.069 0.03857 *
## YearsWithCurrManager -2.527e-01 1.942e-01 -1.301 0.19321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.16 on 279 degrees of freedom
## Residual deviance: 135.65 on 209 degrees of freedom
## AIC: 277.65
##
## Number of Fisher Scoring iterations: 16
Based on the significant variables identified in initial run of Logistic Regression model. Make a subset of the variables from statistical analysis and EDA.
Now, Let’s validate the model.
## [1] "Accuracy = 71.4285714285714 %"
## [1] "Sensitivity = 73.3333333333333 %"
## [1] "Specificity = 69.2307692307692 %"
The Accuracy, Sensitivity and Specificity looks good. Now lets try and run KNN and check the prediction accuracy.
In order to use categorical variables for KNN prediction, first we need to convert them into Integers. Once we have all numeric and integer variables, we will standardize (normalize) then for prediction. To remove the influence of a parameter being in larger scale.
First we will identify the best value of K based on the Accuracy. By running 100 iterations.
## [1] "Best Accuracy at K = 19 of 70.2380952380952 %"
## [1] "Mean Accuracy = 65.7857142857143 %"
## [1] "Mean Sensitivity = 82.7111111111111 %"
## [1] "Mean Specificity = 46.2564102564103 %"
Since iteration recommended K=19. We will run KNN for K=19.
## [1] "Accuracy for K-19 = 70.2380952380952 %"
## [1] "Sensitivity for K-19 = 84.4444444444444 %"
## [1] "Specificity for K-19 = 53.8461538461538 %"
For K=19, we got good Accuracy (above threshold of 60%). However, the Specificity is comparatively lower and marginally under 60%.
Comparing the results from KNN and Logistic Regression. We decide to go ahead with Logistic Regression Model.